Goto

Collaborating Authors

 response 1




HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages

Wang, Zhilin, Zeng, Jiaqi, Delalleau, Olivier, Shin, Hoo-Chang, Soares, Felipe, Bukharin, Alexander, Evans, Ellie, Dong, Yi, Kuchaiev, Oleksii

arXiv.org Artificial Intelligence

Preference datasets are essential for training general-domain, instruction-following language models with Reinforcement Learning from Human Feedback (RLHF). Each subsequent data release raises expectations for future data collection, meaning there is a constant need to advance the quality and diversity of openly available preference data. To address this need, we introduce HelpSteer3-Preference, a permissively licensed (CC-BY-4.0), high-quality, human-annotated preference dataset comprising of over 40,000 samples. These samples span diverse real-world applications of large language models (LLMs), including tasks relating to STEM, coding and multilingual scenarios. Using HelpSteer3-Preference, we train Reward Models (RMs) that achieve top performance on RM-Bench (82.4%) and JudgeBench (73.7%). This represents a substantial improvement (~10% absolute) over the previously best-reported results from existing RMs. We demonstrate HelpSteer3-Preference can also be applied to train Generative RMs and how policy models can be aligned with RLHF using our RMs. Dataset (CC-BY-4.0): https://huggingface.co/datasets/nvidia/HelpSteer3#preference Models (NVIDIA Open Model): https://huggingface.co/collections/nvidia/reward-models-68377c5955575f71fcc7a2a3


incorporating the reviewers ' suggestions. 2 Response to Reviewer # 1 3 Comment 1: " The significance of the proposed method is not very clear "

Neural Information Processing Systems

We greatly appreciate the reviewers' effort and helpful comments. Comment 1: "The significance of the proposed method is not very clear..." It also has great theoretical significance in the optimization area. Though the convergence rate of this method could be suboptimal, it's a practical way to In addition, [6] shows some examples of saddle point algorithms where projection onto the constrain sets is hard. Comment 2: "Why do we consider nuclear norm constraint for this classification problem?" We find that this paper does not have section 5.4 and 5.6.


1 Overall Response 1 We sincerely thank all the reviewers for their helpful comments and constructive suggestions

Neural Information Processing Systems

We sincerely thank all the reviewers for their helpful comments and constructive suggestions. Actually, the motivation of using attention and distillation differs from their origins. Regarding the comparison with related work. Net[C2] and NestedNet[C3] (results taken from their papers). The proposed method is robust to hyper-parameters.



provide two responses to the common concerns raised by the reviewers, and then reply each reviewer, respectively

Neural Information Processing Systems

We would like to thank all the reviewers for your helpful comments and suggestions. As shown in Appendix A.3, the layer-wise GCN network has the highest computational complexity in the computational propagation flow. Please see the response in Common Response 2 . For fair comparison we only report the result on semi-supervised task. Please see the response in Common Response 2 .


R3: Robust Rubric-Agnostic Reward Models

Anugraha, David, Tang, Zilu, Miranda, Lester James V., Zhao, Hanyang, Farhansyah, Mohammad Rifqi, Kuwanto, Garry, Wijaya, Derry, Winata, Genta Indra

arXiv.org Artificial Intelligence

Reward models are essential for aligning language model outputs with human preferences, yet existing approaches often lack both controllability and interpretability. These models are typically optimized for narrow objectives, limiting their generalizability to broader downstream tasks. Moreover, their scalar outputs are difficult to interpret without contextual reasoning. To address these limitations, we introduce $\shortmethodname$, a novel reward modeling framework that is rubric-agnostic, generalizable across evaluation dimensions, and provides interpretable, reasoned score assignments. $\shortmethodname$ enables more transparent and flexible evaluation of language models, supporting robust alignment with diverse human values and use cases. Our models, data, and code are available as open source at https://github.com/rubricreward/r3.


The Pursuit of Empathy: Evaluating Small Language Models for PTSD Dialogue Support

BN, Suhas, Mahajan, Yash, Mattioli, Dominik, Sherrill, Andrew M., Arriaga, Rosa I., Wiese, Chris W., Abdullah, Saeed

arXiv.org Artificial Intelligence

This paper investigates the capacity of small language models (0.5B-5B parameters) to generate empathetic responses for individuals with PTSD. We introduce Trauma-Informed Dialogue for Empathy (TIDE), a novel dataset comprising 10,000 two-turn conversations across 500 diverse, clinically-grounded PTSD personas (https://huggingface.co/datasets/yenopoya/TIDE). Using frontier model outputs as ground truth, we evaluate eight small LLMs in zero-shot settings and after fine-tuning. Fine-tuning enhances empathetic capabilities, improving cosine similarity and perceived empathy, although gains vary across emotional scenarios and smaller models exhibit a "knowledge transfer ceiling." As expected, Claude Sonnet 3.5 consistently outperforms all models, but surprisingly, the smaller models often approach human-rated empathy levels. Demographic analyses showed that older adults favored responses that validated distress before offering support (p = .004), while graduate-educated users preferred emotionally layered replies in specific scenarios. Gender-based differences were minimal (p > 0.15), suggesting the feasibility of broadly empathetic model designs. This work offers insights into building resource-efficient, emotionally intelligent systems for mental health support.


Curriculum-RLAIF: Curriculum Alignment with Reinforcement Learning from AI Feedback

Li, Mengdi, Lin, Jiaye, Zhao, Xufeng, Lu, Wenhao, Zhao, Peilin, Wermter, Stefan, Wang, Di

arXiv.org Artificial Intelligence

Reward models trained with conventional Reinforcement Learning from AI Feedback (RLAIF) methods suffer from limited generalizability, which hinders the alignment performance of the policy model during reinforcement learning (RL). This challenge stems from various issues, including distribution shift, preference label noise, and mismatches between overly challenging samples and model capacity. In this paper, we attempt to enhance the generalizability of reward models through a data-centric approach, driven by the insight that these issues are inherently intertwined from the perspective of data difficulty. To address this, we propose a novel framework, $\textit{Curriculum-RLAIF}$, which constructs preference pairs with varying difficulty levels and produces a curriculum that progressively incorporates preference pairs of increasing difficulty for reward model training. Our experimental results suggest that reward models trained with Curriculum-RLAIF achieve improved generalizability, significantly increasing the alignment performance of the policy model by a large margin without incurring additional inference costs compared to various non-curriculum baselines. Detailed analysis and comparisons with alternative approaches, including data selection via external pretrained reward models or internal self-selection mechanisms, as well as other curriculum strategies, further demonstrate the superiority of our approach in terms of simplicity, efficiency, and effectiveness.